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1 Introduction 

The detection of change-points in heterogeneous sequences is a statistical challenge with many 
applications in fields such as finance, reliability, signal analysis, neurosciences and biology. A wide 
variety of literature exists for finding an ideal set of change-points for characterizing the data, in 
terms of both the number of change-points in a given sequence and their corresponding locations. 

A conventional expression of the change-point problem is as follows: given a dataset Xi-n = 
{Xi, X2, ■ ■ ■ T Xn) of real- valued observations, to find an ideal set of K non-overlapping intervals 
where the observations are homogeneous within each. For K segments, the change-point model 
expresses the distribution of X given a segmentation S'i:„ = {Si, . . . , Sn) G -Mk as: 



(1) 



s=l i,Si- 



where /3s is the emission distribution of the observed data in segment s, 9 = {9i, . . . ,0k) is the 
set of model parameters. Si is the segment index at position i, and Aix is the set of all possible 
combinations of S for fixed K ^ 2 number of segments (for example, 5*1:7 = 1122233 corresponds 
to n = 7 observations divided into K = 3 segments with two change-points: the first one between 
positions 2 and 3, and the second one between positions 5 and 6). We refer to this approach to the 
change-point model as segment-based, later we discuss another common approach to change-point 
detection that we refer to as level-based. For simplicity, denote Pe{—) as P(— ) when the set of 
parameters is clear according to the context. 

1.1 Current HMM algorithms in change-point analysis 

The hidden Markov model [HMM, see Rabiner[ |1989| is a commonly used tool for inference in 
change-point analysis. While HMMs are a conventional feature in change-point methods in bioin- 



formatics Fridlyand et al. 


2004 , it is also widely used in diverse research areas such as speech 


recognition Rabiner 


1989 


, facial recognition Nefian and Hayes HI 1998 , financial time series 


Ge and Smyth 


2000 


, inflation models Chopin and Pelgrin 2004 , music classification Kimber 


and Wilcox 1997 


, brain imaging Zhang et al. 2001 , climate research Hughes et al. 1999 , and 


network security 


Cho and Park 2003 . 



Change-point analysis can be seen as a HMM where the data are the observations and the 
unknown segmentation the hidden states. HMM adaptations can therefore identify change-points 
by observations where a switch in hidden states is most likely to occur. A convenient feature of 
the HMM approaches is in inferential procedures, such as estimating the posterior marginal state 
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distribution P(S'|X). An cfBcient computation in linear time of this quantity uses classical forward- 
backward recursions [Durbin et al. 1998 . The HMM estimation in mixture and change-point 



problems can be accomplished through the expectation-maximization (E-M) algorithm Dempster 
et al.[ [1977] |Bilmes[ |1998|, a nd MCMC methods, inclu ding reversibl e jump MCMC |Green[ 11995^ 



Gibbs sampling Chib 1998 , and recursive algorithms Scott 20021. Other approaches which use 



sampling to characterize the uncertainty in a HMM given the data include particle filtering Fearn- 

and MCMC 



head and Clifford 2003 



Guha et al. 



involved in HMM estimation can be found in Cappe et al. 2005 



2008 . A summary of the inferential procedures 
In addition to the algorithm 



of linear complexity presented in this chapter, two other exact algorithms for estimating posterior 
distributions include a frequentist Guedon 2007 and Bayesian Rigaill et al. 2011 approach which 
use modified versions of the forward-backward algorithm. 

Many schemes for estimating the number of hidden states in the HMM have also been inves- 
tigated, which in general include several penalization criteria. These include a modified Bayes 
Information Criterion Zhang and Siegmund 2007 to adjust for the number of states in previously 

for estimating the 



Lavielle 2005 Picard et al. 2005 



fitted HMM as well as adaptive methods 
location and number of change-points. 

Section [2] presents two different frameworks for applying HMM to change-point models, Section 
[3] provides a summary of two procedures for inference in change-point analysis. Section [4] provides 
two examples of the HMM methods on available data sets, Section [5] provides a short summary of 
HMM and other change-point methods for current genomics studies, and Section [6] provides a short 
conclusion and discussion. 



2 The level- and segment- based change-point models as par- 
ticular cases of the HMM formalisms 

We present a unifying HMM approach that can be adapted to many different approaches to the 
change-point problem. The known properties and current algorithms of the HMM allow for the 
efficient estimation of many quantities of interest. In turn, the results from the HMM may be 
used for model selection, or provide additional information about a given segmentation, such as 
the uncertainty of the change-points. The HMM framework also permits various extensions of the 
change-point model for further practical use. 

A typical HMM is defined by the joint probability distribution 

n 

nXi:n,Si.,„)^P{Si)P{X,\Si)llP{S,\S,_i)P{X,\S,) (2) 

where n is the total number of observations, Xi-n = (A"i, . . . , X„) is the vector of all the observation 
variables, S'i:„ — {Si, . . . , Sn) is the vector of all the hidden variables. Figure [l] provides a simple 
example of the dependencies among variables. 

For homogeneous HMMs define P{Si — s) — /i(s), P{Si — s\Si^i = r) = a{r,s) for all i = 
2, . . . , n and P{X, ^ x\S, ^ s; 9) = /^^(x) for alH = 1, . . . , n. 

Given a evidence, or some prior knowledge of the states of some variables, the computation of the 
posterior probabilities of the hidden variables is an important aspect of Hidden Markov modeling. 
For all observations i = l,...,n, we introduce the formal notion of evidence by considering Xi 
and Si, which are subsets of the two sets of all possible outcomes of Xi and Si, respectively. The 
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Figure 1: HMM topology with n = 5. For i = 1...5, Si are the hidden states, and are the 
observed states. 

evidence is 

£ = {Xi G Xi,Si eSi, for alH = l,...,n.} (3) 

This evidence, which can comprise both observed information X.j and/or known prior information 
Si, provides a constraint on the set of posterior distributions. The unconstrained case occurs when 

and 5, both contain the set of all possible states for Xi and Si, respectively, for alH = 1, . . . , n. 

Dcipcnding on the definition of the hidden state variables S, their corresponding their probability 
distributions and the available evidence, the HMM framework allows the definition of level- and 
segment-based models as follows. 

2.1 Level- based model: the standeird evidence 

In the level-based model, the hidden state Si pertains to the level (or underlying distribution) of 
observation Xi. This is an appropriate model when underlying properties can be shared between 
observations in different, non-adjacent segments. This HMM can be defined similarly to the classical 
HMM segmentation models by choosing the finite set oi L ^ 1 levels, with S* G {1, 2, . . . , L}". With 
this level-based approach K ^ L, and transitions are possible between any pair of states. 

Because all observation variables Xi are observed and there are no constraints on the hidden 
states, the evidence is 

£^ = {Xi=Xi,...,X„ = Xn} = {Xi.,„ = Xun}. (4) 

This is the standard evidence usually found in most applications. 

The transition matrix between hidden states can be assumed to be homogeneous, and is often 
parsimoniously parametrized. For example, consider: 

^ir,s) = [ ^^^^ ;[;^;: (5) 

which uses only L free parameters. 

2.2 Segment-based model 

In the segment-based model, the objective is to find the best partitioning S G A4k of the data into 
K non-overlapping intervals, where the hidden state Si pertains to the segment index of observation 
Xi. 
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One set of constraints [Luong et al. 2012 that allows the HMM to correspond exactly to the 
above segment-based model defined in Equation (fTl) is 



£^ ^{Si^l,Sn^K,Xi.,n^Xi.,„} (6) 

where the transition only permits increments of or +1 of the segment index Si, where K is the 
fixed total number of segments. 

In order to obtain a uniform prior distribution of Si;n over all possible segmentations with K 
segments Mk, define the transition matrix over the state space {1, . . . , K, K + 1} (where if -|- 1 is 
a "junk" state) as follows: 

if r ^ if and s = r 
if r ^ K and s = r + 1 
ii s = r ^ K + 1 
otherwise 

where rj e]0, 1[ is a fixed number. Note that the particular value of rj does affect P(5) but has no 
effect whatsoever on P(5|£'^). An arbitrary choice of 77 = 0.5 is sufficient for practical computations. 



a{r, s) 




3 Inference in level- and segment-based HMMs 
3.1 Variable elimination 



The forward-backward algorithm Rabiner 1989 Durbin et al. 1998 efficiently solves many infer 



ence problems in HMMs; before going into its technical details we illustrate the motivation and the 
very simple ideas behind it. 

Let us consider a HMM of length n in which each hidden variable has K possible states. Suppose 
we observe the standard evidence £ = {Xi^n = a^im}- This is the specific expression for evidence Q 
found in the level-based model; similar expressions also apply to the evidence in the segment-based 
model Q and for the most general form of evidence ([3| . 

An important problem in inference is to estimate the probability of this evidence, which can be 
accomplished by summing the joint distribution P(S'i:„, Xi-^ = a^i:n) over all the hidden variables: 

P(Xl:„ = Xl:„) = J2 P{Sl,...,Sn,Xl.,„=Xl.,^). 
Si,...,S„ 

In a naive approach, the first step would be to evaluate the joint probability for each possible value 
of {Si, . . . ,Sn) and then perform the summations explicitly. However this is a highly inefficient 
method as its total cost is 0{K"). 

The exponential blowup is addressed by observing that the factors in the joint distribution ([2| 
depend each on a small number of variables. A much more efficient algorithm is to evaluate 
expressions depending on these factors once and then cache the results, which avoids generating 
them multiple times. The basic tools which accomplish this are the factorization given in Eq. ^ 
and the distributive property. 
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For example, in the case n — S: 



^P(^i)P(Xi =xi|^i)^P(52|^i)P(X2 = a;2|^2)^P(^3|^2)P(^3 = a;3|^3). (7) 



Si 



B2{S2 



Bi(Si) 

In practice, we choose an ordering of the hidden variables (here the backward order S'3 < ^2 < 
5*1) and then rearrange all factors in order, so that all the factors depending on 53 are the furthest 
on the right. Then all remaining factors depending on S2 are placed to the left, and similarly for 
5*1. This makes it possible to eliminate one variable after another, according to the initial order. 
For each eliminated variable, we obtain a quantity which we cache and use in order to eliminate the 
ensuing variable. These quantities are also called messages and the outlined procedure provides a 
recursive method to compute them. In the case of the backward ordering, the messages are called 
backward quantities. 

The computation of B2{S2) for all the values of S2 requires the sum of K terms (one for 
each value of S'3); thus resulting in 0{K^) operations. Similarly, computing Bi requires 0{K^) 
operations. Because there are n hidden variables the resulting total cost is 0{nK^) which is a 
dramatic improvement over the naive 0{K"') complexity. 

A central aspect of variable elimination is the initial ordering of these hidden variables. For 
instance, consider the elimination order 6*2 < ^3 < S*!, then: 

P(Xl:„ = Xl:„) = 

^P(5i)P(Xi - ^P(X3 = X3I53) ;^P(53|52)P(^2|^l)P(^2 - X2\S2) . 

Si S2 

^ ^ 



D(Si) 

The cost of eliminating 5*2 is 0{K^), with K terms summed for each value of the pair (S'3, Si). As 
a result, the resulting complexity is 0{nK^). 

Another ordering which leads to a 0{nK'^) cost is the forward ordering 5i < 6*2 < S'3: 

^P(X3 = a;3|S3)^P(53|S2)P(X2 = X2|S2)^P(52|^i)P(^i = a;i|Si)P(Si) . 

S3 52 Si 



£2(S2) 



£2(82) and -£3(^3) are closely related to the forward quantities defined in Section 3.2.1 having 
defined i^i(Si) := V{Xi = .ti |5i)P(Si): 

E2{S2) =Y.^{S2\Si)Fi{Si), 

Si 
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and 

F2{S2) ■=nX2=X2\S2)E2(S2)=nX2=X2\S2)Y.nS2\Sl)F^{Si) 

Si 

and 

FsiSs) :=P(X3 = X3\S3)E3{S3)^ViX3 = X3\S3)Y,nS3\S2)F2{S2). 

S2 

At this point there are two alternative formulae for solving our initial inference problem based 
on the recursive forward and backward quantities: 

P(^i:„ = a:i:„)=^F3(53), 

S3 

and also, from Eq. ([t]): 

Si 

The next section explains the use of forward and backward quantities to compute the posterior 
probabilities P(S'i|Xi:„ = xi:n) and other useful distributions. For simplicity we will consider 
homogeneous HMMs. 

3.2 Forward-backward algorithm 

3.2.1 Level-based model: the standard forvifard-backward algorithm 

In the level-based case, the evidence Q is the standard evidence found in many HMMs applications. 
The inference problem is then solved by the classical recursive formulae used for most applications. 
Define the forward and backward quantities: 

Ffis) ■.^ViS, = s,Xi.,, = xi,,) 

and 

Bf{s) := P(X,+l:„ = X,+l:„|5, = s), 

for alH e {1, . . . , n} under the convention that Bf^ = 1. These probability distributions are exactly 
the standard forward-backward quantities found in most applications. 

The recursive computation of the forward and backward quantities requires the following results: 

Proposition 1. For alH = 1, . . . ,n and for each value s of the hidden states: 

ViS,^s,£^)^Ffi.s)Bf{s), (8) 

where S'^ is defined in Q. Moreover for all i G {2,...,n}, for each pair (r, s) and for each 
observation x: 

P(5,_i = r,5, = = Ff_,{r)a{r,.s)l3,{x)Bfis) (9) 

Proof. We start by proving Eq. by applying the chain rule and the conditional independence 
of the events {Xi+i.„ = x^+i.,^} and {Xi-i = J given {S^ = s}: 

= S, Xi-i = Xi.i)P(Xj+i.„ — Xi_|_l:„|5i = S, Xi-i — Xi-i) = 

P(S'i = S,Xx:i = a;i:j)P(Xj+i:„ = Xi^X;n\Si = s) . 
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Similarly for Equation ([o]): 

V{Si-i =r,Si^ s,£^) = P(5.,_i = r,Xi:,,_i = xi.,i^i,Si ^ s,Xi = x^.X^+x-.n = a:^i+i:n) = 

□ 

This proposition establishes all the classical results for inference in HMMs. 

Corollary 2 (Forward and backward recursions). The forward quantities can be computed itera- 
tively starting from Ff{s) = ^{s)l3s{xi) for all i = 2, . . . , n with 

w (10) 

The backward quantities can be computed recursively from = 1 for alH = — 1, . . . , 1 with 

Sf_i(r) = ^a(r,s)/3,(x,)i3f (s). (11) 

s 

Proof. In order to obtain the forward Equation (10), apply Equations ([s]) and Q respectively to 
the right and left hand side of 



A similar argument holds for the backward Equation (11). □ 

The following useful result is a straightforward consequence of Proposition [l] 
Corollary 3 (Posterior probabilities). For each i = 1, . . . ,n, the posterior state probabilities are 

where the probability of the evidence is 

P(£^) = 5]^;^(s)i3f(s), 

s 

for an arbitrary fixed i e {1, . . . , n}; in particular ¥'{£'") = ^jf (*)• 

The following corollary is easily proven and makes it possible to sample the joint distribution 
of the hidden variables recursively: 

Corollary 4 (Forward and backward sampling). The joint distribution of S'i:„ conditionally to £^ 
is a Markov chain whose transitions are given by 

P(5.^.|5._,^r,g^)= "^-'-)^-^f5(-) 

in the forward direction, and by 

K is) 

in the backward direction. 
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3.2.2 Forward-backward algorithm for segment-based model 

For the sake of completeness, we present the results of the inference problem for the HMMs con- 
strained by the evidence (|6]). The formulae presented here are easily obtained by modifying the 
standard formulae proved in section |3.2.1| and are particular cases of the forward-backward algo- 
rithm for HMMs conditioned on the general evidence ([3|. In particular, the formulae from the 
previous section still hold for i ^ In these two special cases consider the constraints 5i = 1 

and Sn = K, which are imposed by adding the multiplicative constants or 1{s„=k} to the 

formulaeQ . 

Define the forward quantities as 

Ff{s) := F{Si = 1, = s, Xi;, = XI:,) 

for i < n — 1 and 

F^{s) P(5i = 1, Sn = s, 5„ = K, = x,.,n) = l{,^A-}F(^i = 1, ^„ = s, X,.,,, = x,.,^). 

Define the backward quantities as 

Bf{s) := P(X,+i:„ = a:,+i:„, Sn = K\S, = s), 

for all i G {!,..., n}, with the convention that = 1. The corresponding equations for (|8| 
and ^ become P(S', = s,£^) = Ff{s)Bf{s) and 

P(5,_i = r,S, = = Ff_,{r)air,s)P,{x)Bf{s) 

for all z = 1 , . . . , n — 1 and 

nSn-i=r,Sn = = l{,^K}Ft^ir)air,s)Mx)B^{s). 

The forward quantities can be computed iteratively starting from Ff (s) =^ 'i-{s=i}f^{s)Ps{xi) for 
alH = 2, . . . , n — 1 with 

i^f(s) = ^i^f_i(r)a(r,s)/3,(x,), 

r 

and Fn{s) = 1{s=k} J2r -Fn-i{''^)'^{'''i ^)Ps{xi). The recursions for computing the backward quanti- 
ties are exactly the same as in the level-based setting: 

s 

The preceding constraints result in less computations due to the sparse transition matrix between 
hidden states, as a{r, s) = if s — r ^ (0, 1) leading to a 0{Kn) complexity. In the constrained 
model we note that the probability of the evidence can be computed, for instance, with ¥{£^) = 
Ff{l)Bf{l). 

^ Given an event A, 1^ = 1 iff ^ is true. 
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3.3 Extensions 



Extending the previously described recursive formulae and results to more general evidence forms 
is a straightforward process. For instance, suppose that we observe Xi — Xi for all i except 
for i — 5, which we will consider as missing. In this situation, the evidence is £ — {Xi;4 = 
xi-A^Xg-n — XQ:n}- Define the forward and backward quantities exactly as in the level-based case 
with the only diff'erence that for i > 5, Fi{s) ~ F{Si — s,Xi;4 — xi-A,X^;i — xq-a) and for i < 4 
Bi(s) = P(Xi+i:4 = Xi+i-A, Xg-n — XG;n\Si = 6). In practice, all the results from the level-based 
case still hold if we substitute f3s{x5) by 1 in the formulae expressing ¥{84 — r,S5 — s,£), ^5(5), 
Biir), P(S'5 = s\Si = r,£) and P{Si = r\S5 = s,£). 

Another situation is all variables (Xi,...,X„) being observed with additional partial prior 
knowledge about the hidden variables. Suppose for instance Xi-n = xi-n and, say, 5*7 7^ 2. The 
evidence is £ = {Xi-n = Xl■n,S^ 7^ 2} and we define the forward and backward quantities as 
before with the following exceptions: Fi{s) = f'{Xi-i = xi-i^S-; ^2,Si = s) for i > 7 and Bi{s) = 
i+i = Xi^i, Si 2\Si = s) for i < 7. All the results continue to hold by substituting a{r, s) by 
ls5^2 • a(r, s) in the formulae expressing P{Se ^ r,S7 ~ s,£), Fi{s), B(i{r), P{Sj = s\S(i = r,£) and 
P{SQ = r\ST = s,£). 

The forward-backward algorithm is also known as the product-sum algorithm and is a particular 



case of the message propagation algorithm for Bayesian networks KoUer and Friedman 2009 
As previously seen, its basic mechanism is the simple distributive property of multiplication over 
addition. 

Another common inferential problem in HMMs is in finding a set of variables with the largest 
joint state probability and their corresponding marginal probabilities. In general, the set which 
maximizes the joint state probabilities may not exactly match the set that maximizes each marginal 
probability separately. The max-sum algorithm (also known as the Viterbi algorithm) addresses this 



issue by replacing summations by maximizations in all the above formulae [Viterbi [1967 Rabiner 



1989 . The previous results continue to hold because the multiplication is distributive over the max 



operator: max(a6, ac) = amax(6, c). 
3.4 Floating-point issues 

Underflow is a common issue when using forward and/or backward recursions in floating-point 
arithmetic. Indeed, the magnitude of the forward quantity Fi decreases geometrically with i and 
can be smaller than the smallest machine float (ex: 2.23 x 10"'^°^ for C-I--I- double-precision on 
1686 architecture) which is an architecture-dependent threshold. As suggested in [Rabiner 1989 



pages 272-273], one solution consists in keeping track of a rescaling parameter (typically stored in 
log-scale) for each i. To improve this inefficient approach, in terms of both time and memory, we 
suggest to use log-scale computation through logi^i and log Bi both for level- and segment-based 
models. 

For both expressions of F^ and B^ we recommend the initial precomputation of loga(r, s) and 
log j3s{xi) for all r,s,i. The forward and backward recursions become: 

logF,{s) = logsum[logF,_i(-) +loga{-,s) + log (3 s {x,)] 

logBj_i(r) = logsum[loga(r, •) + logP.{xt) + logBi{-)] 
where logsum is a function of any real vector z — (zi, Z2, ■ ■ ■ z^). For example if zi ^ Z2 ^ ■ ■ ■ Zk 
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and logsum(zi, Z2, ■■ ■,Zk) = logj^j exp(2i): 

logsum(zi, Z2, . . . , Zk) ^ zi + loglp[exp(z2 - zi) + . . . + exp(zfe - zi)] 
where loglp(u) = log(l + u), which is useful in floating-point arithmetic when u is small. 

3.5 E-M algorithm 

Let £ be either £'-' or £^ or a more general evidence. The Expectation-Maximization (EM) algo- 
rithm consists of repeating iteratively the two following steps: 

• i;xpectation (E): compute Q{e'\e) = Y.g¥g{S\£)\ogFg,{S,£); 

• Maximization (M): update 9 with 9 ~ argmaxg/ Q{6'\9). 

The resulting update formulae depend on the model considered (level- or segment-based) and on 
the nature of the emission distribution. In this section, we consider only the two following emission 
models: 

• normal homoscedastic: for all a; G M and hidden state s, /3s (a;) — ip{{x — fis)/<^) where /ig S K, 
cr > 0, with Lp{z) = exp(z2/2)/\/27r for all z e M; 

• Poisson: for all /c € N and hidden state s, Ps{k) — exp(— As)Ag//c! where As > 0. 

During the E-step, forward-backward recursions obtain W'g{S\£) as a heterogeneous Markov 
chain. In practice, it is only necessary to compute Pe (Si |£) — Fi{Si)Bi{Si) /Pe{£) andFg{Si-i, Si\£) - 
F,^iiS,^i)aiS,.i,S,)PsM)Br{X,)/Vgi£). 

For both level- and segment- based model, the M-step is the same for the emission part of 
Q{9'\9): 

~ _i _ELi^W^^^i^ , ~2_ EtiJ:six^~~^srn{s.^.s\£) 

2l^=x^e\S^^ s\£) n 

For the transition part, there is no parameter to update in the segment-based case, and for the 
level-based case (with the transition matrix defined in Eqjs]): 

. ^ J:r^sJ:t2MS^-l^r,S,^s\£) 

Et2MS.^i=r\£) 

3.6 Change-point estimation 
3.6.1 Level-based model 

Change-points are identified based on observations where transitions are most likely to occur. In a 
HMM framework, we assess the probability that i is any change-point between two different hidden 
states, regardless of the ordering within the set of change-points. 



10 



The posterior probability of any change-point occurring at observation i, or such that St ^ -S'j+i, 

is: 



r^s 



i^f(r)a(r,s)/3,(x,+i)i3f,i(s) 



E 



for i = 1, 



,,n- 



1 and r, s = 1 , 



, L, where the last equahty follows from Section 3.2.1 



Unlike the segment-based approach, the level-based approach does not allow to obtain the the 
distribution of the specific r*'' change-point, but only the marginal probability of a change-point 
at a given position. It is however possible to derive the distribution of the r*'* change-point in the 
level-based approach as the waiting time of a regular expression in a heterogeneous Markov chain 



Aston and Martin 2007 . 



3.6.2 Segment-based model 

Define the location of the r*^ change-point as CPr- Under this definition, the posterior probability 
of the r*'' change at observation i is P(CPr = i) = V{Si = r,Si+i = s\£^), where s = r + 1. In 
other words the r*'* change-point is the last observation of segment r before segment r -I- 1, each of 
whom consists of contiguous, homogeneous obser vations. 

Using the formulae provided in Section 3.2.2 the posterior probability of the r*'* change-point 
occurring after observation i, where s = r + 1 is: 

V{CPr - i\£^) = = r, S,+i = s\£^) 

= ¥{S,+i = s\S, = r,£^)¥{S, = r\£^) 
^ Ffir)a{r,r + l)Pr+i{x;+i)Bf+^{r + 1) 
Ff(l)i3f(l) 



for z = 1 , . . . , 71 — 1 and r — 1^ . . . ^ K — 1. 



4 Examples 

4.1 British coal mining disaster data set 



We illustrate the methods on the classical British coal mining disaster data set Carlin et al. 1992 
which displays the number of accidents per year in Great Britain between 1851 and 1962, n — 112. 
The observed count data in the HMM use a Poisson emission distribution. For a change-point 



model with 3 segments, a greedy least squares minimization algorithm Hartigan and Wong 1979 
identified change-points at i = 36 and 97, corresponding to the years 1886 and 1947, respectively. 

The level-based approach models the change-points as transitions between states, which corre- 
spond to mean parameters of the Poisson distribution, or levels. Using these initial change-points 
we obtain maximum likelihood estimates of means A = (3.25, 1.15, 0.27) and the transition proba- 



bilities to be = (1/36, 1/61). Through the forward-backward algorithm described in section 3.2.1 
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Figure 2: Level-based model. Plots of estimated posterior probabilities in British coal mining disaster 
data set Carlin et al. 1992 . Dots are annual accidents with horizontal lines being the maximum 
likelihood estimates of the L — 3 level means, (a) Posterior marginal probability of i being in state 
r, F{Si = r\£^), with solid line being state 1, dashed line state 2 and dashed-dotted line state 3. (b) 
Posterior probability of i being any change-point P(CP = i\£^) = P{Si = r, Si+i = s\£^), where 
r 7^ s. 
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(a) Posterior state probability 
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year 

(b) Posterior change-point probability 



Figure 3: Segment-based model. Plots of estimated posterior probabilities in British coal mining 
disaster data set [Carlin et al. 1992 . Dots are annual accidents with horizontal lines being the 
maximum likelihood estimates of the K = 3 segment means, (a) Posterior marginal probability of i 
being in state r, V{Si = r\£^), with solid line being segment 1, dashed line segment 2 and dashed- 
dotted line segment 3. (b) Posterior probability of i being r*'' change-point P{CPr — i\£^) — 
V{Si = r, Si+i = r + ll^""^), with solid line being change-point 1, dashed line change-point 2. 
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the posterior estimates of the state of observation i given the data are P{Si — r\ £'-') (F igure 2(a) I 
for each of L = 3 levels, and for any change occurring at i P(CP = i\£^) (Figure 2(b) I 



Figures |3(a) and |3(b)| display the posterior probabilities of the marginal distribution ¥{Si = 



r\£^) for each of K = 3 segments and change-point probability F{CPr = ilS^) = ^{Si = r, 5^4-1 — 
r + 11^"^) for if — 1 — 2 change-points, respectively. The segment-based approach models the 
change-points as the beginning and end of 3 intervals of contiguous observations with homogeneous 
distributions. Considering the hidden states of both level and segment -based HMMs are the same, 
the two different approaches produce almost identical respective marginal state and change-point 
curves. 

From a historical perspective it is practical to look for events that may have triggered these 



detected changes in accident rates. In previous change-point literature, Raftery and Akman 1986 
noted that the first change at year 1886 occurred during a decline in labor productivity at the end 
of the 1800s and the emergence of the Miners' Federation. Though, the uncertainty in the first 
change-point location suggests that the reduction in accidents was not due to a sudden event but 
to a continual decline over this time period. Meanwhile, the second change-point at 1947 coincides 
with a more clearly-defined time point that is two years after the end of the second World War. 

4.2 Breast cancer data set 



We apply the methods to a widely referenced data set from a breast cancer cell line BT474 Snijders 



et al. 2001 . The data consist of log-reference ratios (LRRs) signifying the ratio of genomic copies 
of test samples compared to normal. The goal is to segment the data into segments with similar 
copy numbers, with change-points pointing to a copy number aberration that may signify genetic 



mutations of interest [Pinkel et al. 1998 . We use the same data previously analyzed Rigaill 
et al. 2011 , consisting of n = 120 observations from chromosome 10. The observations are sorted 
according to their relative position along chromosome 10. The observed LRR data in the HMM 
uses a normal emission distribution. 

For a change-point model with 4 segments, the least squares algorithm identified the most likely 
change-points a.t i — 68,80 and 96. Under this specific level-based model, the observations in 
segments 1 and 3 may share the same distribution. Using these initial change-points, the maximum 
likelihood estimates are fi — (0.271, —0.039, —0.636) for the 3 levels and the transition probabilities 
to he fj — (2/84, 1/16). We obtain the posterior estimates of the state of observation i given the 



data F{Si — r\£^) (Figure [4(a) ) for L = 3 levels and any change occurring at i F{CP — 
(Figure [itb)]). 



Figures 5(a) and |5(b)| display the posterior probabilities of the marginal distribution P{S; 



r\£ ) br K — A segments and change-point probabilities P{CPr = i\£ ) = ^iSi = r,Si+i — 
r + l\£^) for if — 1 = 3 change-points, respectively. The maximum likelihood estimates of means 
are p, = (0.289,-0.039,0.224,-0.636) for the 4 segments. With different segments 1 and 3 defined 
as homogeneous, the two approaches produce slightly different probability distributions of change- 
point locations, though the locations of peaks remain similar. 

Figure [6] compares the change-point probability curves for both K — 3 and K — A segment-based 
change-point models. The A' = 3 segment-based model does not include the second change-point 
at i = 80. The shape of the posterior probability curve of the first change-point slightly changes 
between the two models, due to the uncertainty in the precise location of this point. On the 
other hand, the peak of the last change-point is close to one, and due to this high precision, the 
corresponding change-point curve from both the K = 3 and K — A change-point models virtually 
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index 

(a) Posterior marginal state probability 
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index 

(b) Posterior change-point probability 



Figure 4: Level-based model. Plots of estimated posterior probabilities in breast-cancer data set 
Dots are LRR with horizontal lines being the maximum likelihood estimates 
3 level means, (a) Posterior marginal probability of i being in state r, V{Si = r\£'-'), with 



Snijders et al. 2001 



of the L 

solid line being state 1, dashed line state 2 and dashed-dotted line state 3. (b) Posterior probability 
of i being any change-point P(CP = = P(S'i = r, S'^+i = s\£^), where r ^ s. 




20 40 60 80 100 120 
index 

(a) Posterior marginal state probability 
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index 



(b) Posterior change-point probability 



Figure 5: Segment-based model. Plots of estimated posterior probabilities in breast-cancer data set 
Snijders et al. 2001 . Dots are LRR with horizontal lines being the maximum likelihood estimates 
of the K = 4 segment means, (a) Posterior marginal probability of i being in state r ¥{Si — r\£^), 
with solid line being segment 1, dashed line segment 2, dashed-dotted line segment 3 and long 
dashed line segment 4. (b) Posterior probability of i being r*'' change-point P{CPr = ^l^^^"^) — 
P{Si — r,Si+i = r + with solid line being change-point 1, dashed line change-point 2, and 

dotted-dashed line change-point 3. 
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Figure 6: Plots of estimated posterior change-point probabilities of K — 3 and K — A segment 
change-point models in breast-cancer data set [Snijders et al. |2001| . Dots are LRR, lines are 
posterior probabilities of a change-point, where V{CPr = = P(>S'i = r, Si+i = r + ll^"^). 

Change-point probability curve is dotted for = 3 segment model and solid for = 4 segment 
model. 
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overlap. 



Current implementations of change-point models for ge- 
nomics data 



A common point of interest in bioinformatics is to find genetic mutations pointing to phenotypes 
susceptible in cancer and other diseases. The detection of change points in Copy Number Variation 
(CNV) is a critical step in the characterization of DNA, including tumoral DNA in cancer. A CNV 
may locate a genetic mutation such as a duplication or deletion in a cancerous cell that is a target 
for treatment. 

Many level-based approaches use the hidden state space in a classical discrete HMM to char- 
to map the number of states and the most likely state 



Fridlyand et al. 2004 



acterize mutations 

at each position. Various extensions to this HMM approach include various procedures such as 



merging change-points and specifying prior transition matrices Willenbrock and Fridlyand 
Marioni et al. 2006| to improve the results 



2005 



Other extensions include reversible-jump Markov 

a continuous-index 



chain Monte Carlo (MCMC) to fit the HMM [Rueda and Diaz-Uriarte| |2007 
HMM 

data, 



Stjernqvist et al. 2007| that takes into account the discrete nature of observed genomics 



sson et al. 2008 



and methods that take into account genomic distance and overlap between clones Ander- 
This HMM approach has also been extended for simultaneous change-points 
and for simultaneously identifying multiple outcomes 



across multiple samples Shah et al. 2009 



Liuetal. 2010 . Other HMM-based implementations Colella et al. 2007 Wang et al. 2007 deal 



with higher-resolution data for current SNP array technologies. 

A wide amount of non-HMM approaches are also available for bioinformatics data. The aim 
of these approaches is typically in finding contiguous segments consisting of observations with the 
same distribution. One such implementation is a non-parametric extension of binary segmentation 



for multiple change-points through permutation Olshen et al. 2004 along with a faster extension 



Venkatraman and Olshen 2007 which uses stopping rules. Various smoothing techniques Hupe 



et al. 2004 Eilers and De Menezes 2005 Hsu et al. 2005 have also been applied to change-point 



modelling. Summaries and comparisons of the various approaches for finding change-points in 



genomics data are available Willenbrock and Fridlyand 2005 , Lai et al. 2005 . 



6 Conclusion 

This chapter describes a simple algorithm using hidden Markov models to estimate posterior dis- 
tributions of interest in change-point analysis, using two different modelling approaches. It also 
addresses computational issues through several simple constraints on the HMM to allow for esti- 
mates in a feasible amount of time. 

HMMs are a special case of Bayesian Networks (BNs) , which are probabilistic graphical models 
that represent random variables and their dependencies via a Directed Acyclic Graph (DAG), for 
example in Figure [l] The conditional dependencies coded by the edges of the DAG determine a 
factorization of the joint probability distribution. Given any type of evidence, summing variables 
out in the joint distribution obtains the conditional distributions of the variables. For BNs, this 
is done in a precise fashion by the exact Belief Propagation (BP) algorithm, which efficiently ap- 
plies the distributive law to provide a recursive decomposition of the initial sums of products. The 
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intermediate sums of products (the so-called messages or beliefs) propagate through a secondary 
tree-shaped structure called a junction tree or graph tree. This general framework allows for recur- 
sive algorithms to obtain various quantities of interest including posterior distribution, likelihood or 
entropy-related terms. Moreover, the recursive algorithms for BNs permit model selection through 
the use of typical criteria such as the Bayesian Information Criterion [BIG, see Schwarz 1978 or 



the Integrated Completed Likelihood [ICL, see Biernacki et al. 2000 



The exact BP general algorithm makes it possible to easily derive forward and backward re- 
cursions for HMMs for any kind of evidence. Moreover the BN unifying framework permits simple 
extensions of the aforementioned change-point HMMs to more complex and realistic models ac- 
counting for intricate dependencies between the variables or data. As an example, a straightforward 
generalization of the aforementioned segment- and level-based models is to account simultaneously 
for both hidden segments and levels. More specifically, let Si and Li be the segment and level, 
respectively, of observation Xi. In this specification, the current level Li not only depends on Li-i, 
but also on both the hidden states Si and 5^-1. 

Another useful extension is a model with joint segmentation of multiple samples. In a two- 
sample situation, there are two sets of distinct observations Xi,. . . ,X„ and Yi, . . . ,Yn, one for 
each sample. Let [/„,...,?/„ be the hidden variables corresponding to the segments of the first 
sample and Vi , . . . , y„ the variables corresponding to the segments of the second sample. The 
model assumes that [/i, . . . , [/„ and Vi, . . . ,Vn both depend on a common segmentation Si, . . . ,Sn- 
An advantage of this model is that the joint segmentation of both samples is not independent and 
can identify common features of both samples. 
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A Sample R code 

The following R code presents a toy example for data simulated from the Poisson distribution, with 
n = 100 observations and J = 3 different segments, with change-points after the observations 25 
and 75. The previously described algorithms calculate the probability of an observation being in a 
hidden state (marginal) and being a change-point (cp). 
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The code for the level-based models: 

n=100; # not usable for large n due to underflow issue 
(logscale version needed for larger n) 
L=3; # number of levels 

lambda = c (4 . 5 , 3 . , 7 . 0) ; # parameters for the three level 
eta=0.03; # transition parameter 

# transition matrix 

pi=matrix(eta/ (L-1) ,ncol=L,nrow=L) ; 
diag(pi)=1.0-eta; 

# generate the data according to the model 
set . seed (42) 

s=rep(NA,n); s[l]=l; 

for (i in 2:n) s [i] = sample ( 1 : L , size = 1 , prob = pi [ s [i - 1] ,]); 
x = rpois(n, lambda [s] ) ; 

ref cp=which (dif f (s) ! =0) ; # reference change-point location 

# forward recursion 
F=matrix(rep(NA,n*L) ,ncol=n) ; 

F[,1]=0.0; F[l,l]=1.0*dpois(x[l] , lambda [1] ) ; 
for (i in 2:n) for (1 in 1:L) 
F [1 , i] =sum (F[,i-l]*pi[,l]*dpois(x[i] , lambda [1] ) ) ; 

# backward recursion 

B=matrix(rep(NA,n*L) ,ncol=n) ; B[,n]=1.0; 

for (i in seq (n , 2 , by=-l) ) for (1 in 1:1) 
B[l,i-l] = sum(pi[l,]*dpois(x[i] , lambda) *B [ , i] ) ; 

# probability of the evidence 
pevidence=F[l,l]*B[l,l]; 

# consistency check TRUE if OK 

# NOT RUN: prod ( apply (F*B ,2 , sum) ==pevidence ) ==1 

# marginal distribution 
marginal=F*B/pevidence ; 

# posterior distribution of change -points 
cp=rep (0 ,n) ; 

for (i in 2:n) for (1 in 1:L) 

cp[i-l]=cp[i-l] +sum(F [l,i-l]*pi[l,-l] *dpois (x [i] , lambda [-l])*B[-l,i]) /pevidence ; 

par(mfrow=c(l ,2)) ; 

plot (x , main= " post erior marginal distribution of segment index ", pch=16 , col= " blue ") ; 

abline (v=ref cp , col =" blue " , lty=4) ; points (lambda [s] , t="l" , col =" blue " , lty=4) ; 

for (j in 1:L) points(max(x)*marginal[j,],col=j,lty=j,t="l",lwd=2); 

plot (x,main=" post erior change -point distribution" ,pch=16,col="blue") ; 

abline (v=ref cp , col="blue " , lty=4) ; 

points (lambda [s] , t = "l" , col = "blue " , lty=4) ; 

points(max(x)*cp,t="l",lwd=2); 
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The code for the segment-based models: 

n=100; # not usable for large n due to underflow issue 
#(logscale version needed for larger n) 
K=3; # number of segments 

lambda = c (4 . 5 , 3 . , 7 . 0) ; # parameters for the three segments 

ref cp = c (25 , 75) ; # ref change point locations 

s=c (rep (1 , 25) , rep (2 , 50) , rep (3 , 25) ) ; # true segment index 

# generate the data according to the model 
set . seed (42) 

x = rpois(ii, lambda [s] ) ; 

# forward recursion 
F=matrix(rep(NA,n*K),ncol=n); 

F[,1]=0.0; F[l , 1] =1 .O*dpois (x [1] , lambda [1] ) ; 
for (i in 2:n) { 

F[2:K,i]=(0.5*F[l:(K-l),i-l]+0.5*F[2:K,i-l])*dpois(x[i], lambda [2 : K] ) ; 
F[l,i]=0.5*F[l,i-l]*dpois(x[i] , lambda [1]); 

}; 

# backward recursion 

B=matrix (rep (NA ,n*K) , ncol=n) ; B[,n]=0.0; B[K,n]=1.0; 
for (i in seq (n , 2 , by = -l) ) {_ 

B[l:(K-l),i-l]=0.5*B[l:(K-l),i]* 

dpois (x[i] , lambda [1 : (K-1) ] ) +0 . 5*B [2 : K , i] *dpois (x[i] , lambda [2: K] ) ; 

B[K,i-l]=0.5*B[K,i]*dpois(x[i] , lambda [K] ) ; 

}; 

# probability of tie evidence 
pevidence=F[l,l]*B[l,l]; 

# consistency check TRUE if OK 

# NOT RUN: prod ( apply (F*B ,2 , sum) ==pevidence ) ==1 

# marginal distribution 
marginal=F*B/pevidence ; 

# posterior distribution of change -points 
cp=matrix (0,nrow=K-l ,ncol=n) ; 

for (k in 1:(K-1)) cp [k , 1 : (n-1 ) ] =F [k , 1 : (n- 1 ) ] /pe vidence * . 5 *B [k + 1 , 2 : n] * 
dpois (x [2 : n] , lambda = lambda [k + 1] ) ; 

par(mfrow=c(l ,2)) ; 

plot (x , main= " post erior marginal distribution of segment index ", pch=16 , col= " blue ") ; 

abline (v=ref cp , col =" blue " , lty=4) ; points (lambda [s] , t="l" , col =" blue " , lty=4) ; 

for (k in 1:K) points(max(x)*marginal[k,],col=k,lty=k,t="l",lwd=2); 

plot (x,main=" post erior change -point distribution" ,pch=16,col="blue") ; 

abline (v=ref cp , col="blue " , lty=4) ; 

points (lambda [s] ,t = "l" ,col = "blue" ,lty=4) ; 

for (k in 1:(K-1)) points (max (x) * cp [k ,], col=k , lty=k , t= " 1 ", lwd=2) ; 
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B Sample datasets 



The number of annual accidents from 1851 — 1962 in the British coal mining disaster data set Carlin 



et al.[|1992| : 
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